Recovering Structure from Unstructured Web-accessible Classified Advertisements

نویسندگان

  • Richard Cole
  • Peter Eklund
چکیده

This paper describes a research prototype system called RFCA for structuring Web-accessible rental classified advertisements based on semantic content. A hand crafted parser is used to extract various facets of the rental property being advertised including amongst others; number of room, type of garage, dwelling type (unit, house, or high rise apartment), price and contact details. The performance of the parser is measured in terms precision and recall by comparing its output to that of human expert. The structured information once extracted is stored in a relational database and users searching for rental properties are presented with a graphical organisation of rental properties according to predefined themes. The overall result is a suite of tools for extracting, cleaning, structuring, and visually querying/browsing collection of web-accessible rental advertisements. The mathematical and methodological foundation for the graphical organisation of the structured information is provided by formal concept analysis. Using formal concept analysis each property is understood to be an object possessing attributes with attribute values. The data is then conceptually organised via concept lattices dynamically according to pre-defined conceptual scales. The concept lattice organises rental properties into conceptual groupings. The user then has the opportunity to view the attributes of all properties in a grouping as well as navigate back to the source advertisements. The interface is delivered over the web using a CGI interface and dynamic creation of image and image maps. The ideas presented are general enough to be relevant to other web-accessible unstructured text sources. 1 Rental Formal Concept Analysis Many newspapers hold large keyword indexed freetext collections of classified advertising on the Web Proceedings of the Fifth Australasian Document Computing Symposium, Sunshine Coast, Australia, December 1, 2000. and these can be searched on-line. The intention with this work is to demonstrate how such data can be value-added by extraction, cleaning and structuring. A structured database derived from freetext classifieds can then be browsed effectively. We argue that a browsing interface that structures the presentation of classifieds, related to a particular purpose (such as rental classifieds), can facilitate retrieval. More specifically, we maintain that a browsing interface using formal concept analysis gives a sense of the way that attributes within the free-text sources are distributed across the text collection, something that a keyword-based search interface cannot do. Formal Concept Analysis (FCA) [13] has been developed during the last twenty years and successfully applied to data analysis and knowledge processing [15]. The Mathematics of FCA has been carefully described in Ganter and Wille [9] and the basic details of the theory are omitted here for brevity. There have been a number of examples of FCA applied to information retrieval and filtering [10, 1] including our own work [3, 4]. In these systems the main difficulty is attribute identification from texts. In the medical and email domains in which have worked [8, 2, 5], objects are easily identified since they correspond to documents: typical stored as a single file. In the case of Webaccessible rental classifieds the extraction task is complicated by object recognition: several rental classifieds often appear in the same advertisement and are grouped by location, price or the number of bedrooms. An example of such a problematic advertisement is illustrated in Fig. 1. 2 Object and Attribute Identifica-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Browsing Semi-structured Web Texts Using Formal Concept Analysis

Query-directed browsing of unstructured Web-texts using Formal Concept Analysis (FCA) confronts two problems. Firstly on-line Web-data is sometimes unstructured and any FCA-system must include additional mechanisms to structure input sources. Secondly many online collections are large and dynamic so a Web-robot must be used to automatically extract data. These issues are addressed in this paper...

متن کامل

eDEW: Effective Data Extraction from Web

Internet has become most popular place for accessing World Wide Web (WWW). With the enormous growing amount of information over Internet, accurate and efficient web data extraction has become necessary. Nevertheless, there are various kind of web pages which are having structured, semi-structured and unstructured data. A web page is a formation of many information blocks. Besides an informative...

متن کامل

Browsing Semi-structured Texts on the Web using Formal Concept Analysis

Browsing unstructured Web-texts using Formal Concept Analysis (FCA) confronts two problems. Firstly, on-line Web-data is sometimes unstructured and any FCAsystem must include additional mechanisms to discover the structure of input sources. Secondly, many on-line collections are large and dynamic so a Web-robot must be used to automatically extract data when it is required. These issues are add...

متن کامل

A Conceptual-Modeling Approach to Extracting Data from the Web

Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich documents (e.g., advertisements, movie rev...

متن کامل

Georeferencing Semi-Structured Place-Based Web Resources Using Machine Learning

In recent years, the shared content on the web has had significant growth. A great part of these information are publicly available in the form of semi-strunctured data. Moreover, a significant amount of these information are related to place. Such types of information refer to a location on the earth, however, they do not contain any explicit coordinates. In this research, we tried to georefer...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000